Titanic Survival Prediction Project
Objective:
The primary goal of this project is to predict the survival of passengers onboard the Titanic using machine learning techniques.
By analyzing and interpreting the available dataset, we aim to build a reliable model that can accurately predict whether a passenger survived or did not survive the disaster based on various features, such as age, gender, class, and more.
The insights gained from this project can help us understand the factors that contributed to a passenger's survival and provide a historical perspective on the tragedy.
Let's take a look the variables in this dataset.
| Variables | Definition | Key | Type |
|---|---|---|---|
| Survived | Survival | 0 = No, 1 = Yes | int |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3= 3rd | int |
| sex | Sex | object | |
| Age | Age in years | Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5 | float64 |
| sibsp | # of siblings / spouses aboard the Titanic | Sibling = brother, sister, stepbrother, stepsister ; Spouse = Husband, wife (mistresses and fiances were ignored) | int |
| parch | # of parents / children aboard the Totanic | Parent = mother, father ; Child = daughter, son, stepdaughter, stepson ; Some children travelled only with a nanny, therefore parch= 0 for them. | int |
| ticket | Ticket number | object | |
| fare | Passenger fare | float | |
| cabin | Cabin number | object | |
| embarked | Port of Embarkation | C = Cherbourg, Q= Queenstown, S = Southampton | object |
Now, start Data Exploration and Analysis.
### Import Libaray ###
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px
from matplotlib.cm import rainbow
import seaborn as sns
import plotly.io as pio
pio.renderers.default = 'notebook'
print('Done')
Done
train_data = pd.read_csv('train.csv') #Read 'train.csv' and 'test.csv' into a pandas dataframe
test_data = pd.read_csv('test.csv')
train_data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
train_data.tail(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 886 | 887 | 0 | 2 | Montvila, Rev. Juozas | male | 27.0 | 0 | 0 | 211536 | 13.00 | NaN | S |
| 887 | 888 | 1 | 1 | Graham, Miss. Margaret Edith | female | 19.0 | 0 | 0 | 112053 | 30.00 | B42 | S |
| 888 | 889 | 0 | 3 | Johnston, Miss. Catherine Helen "Carrie" | female | NaN | 1 | 2 | W./C. 6607 | 23.45 | NaN | S |
| 889 | 890 | 1 | 1 | Behr, Mr. Karl Howell | male | 26.0 | 0 | 0 | 111369 | 30.00 | C148 | C |
| 890 | 891 | 0 | 3 | Dooley, Mr. Patrick | male | 32.0 | 0 | 0 | 370376 | 7.75 | NaN | Q |
train_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 891 non-null int64 1 Survived 891 non-null int64 2 Pclass 891 non-null int64 3 Name 891 non-null object 4 Sex 891 non-null object 5 Age 714 non-null float64 6 SibSp 891 non-null int64 7 Parch 891 non-null int64 8 Ticket 891 non-null object 9 Fare 891 non-null float64 10 Cabin 204 non-null object 11 Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.7+ KB
test_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 418 entries, 0 to 417 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 418 non-null int64 1 Pclass 418 non-null int64 2 Name 418 non-null object 3 Sex 418 non-null object 4 Age 332 non-null float64 5 SibSp 418 non-null int64 6 Parch 418 non-null int64 7 Ticket 418 non-null object 8 Fare 417 non-null float64 9 Cabin 91 non-null object 10 Embarked 418 non-null object dtypes: float64(2), int64(4), object(5) memory usage: 36.1+ KB
train_data.isnull().sum()
PassengerId 0 Survived 0 Pclass 0 Name 0 Sex 0 Age 177 SibSp 0 Parch 0 Ticket 0 Fare 0 Cabin 687 Embarked 2 dtype: int64
test_data.isnull().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 86 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 327 Embarked 0 dtype: int64
# Nan value in 'Embarked' replaced by mode
mode_Embarked = train_data['Embarked'].mode().iloc[0]
train_data['Embarked'].fillna(mode_Embarked, inplace=True)
Given that the 'cabin' variable has 687 missing values out of 891 rows, which corresponds to approximately 77% of the data, it can be challenging to accurately impute the missing values. In this case, it is reasonable to consider dropping the 'cabin' feature from the dataset.
However, before making a decision, it is essential to assess whether the 'cabin' variable holds any valuable information that might contribute to the prediction of passenger survival. For example, it is possible that certain cabin areas had easier access to lifeboats, which could be a significant factor in survival prediction.
# Extract the first letter of the cabin to represent the cabin section
train_data['CabinSection'] = train_data['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else x)
# Calculate the survival rate for each cabin section
cabin_survival = train_data.groupby('CabinSection')['Survived'].mean().sort_values(ascending=False)
# Plot the survival rate for each cabin section
plt.figure(figsize=(10, 5))
sns.barplot(x=cabin_survival.index, y=cabin_survival.values)
plt.xlabel('Cabin Section')
plt.ylabel('Survival Rate')
plt.title('Survival Rate by Cabin Section')
plt.show()
The distribution of survival rates across cabin sections indicates that there may be some relationship between the cabin location and the survival rate of passengers.
It seems that passengers in sections D, E, and B had higher chances of survival, while those in sections A and T had lower chances.
This information could potentially be useful for our predictive model.
Given this observed relationship, it is worth considering keeping the 'CabinSection' feature.
train_data['CabinSection'].fillna('Unknown', inplace=True)
train_data.head(5)
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | CabinSection | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | Unknown |
| 1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | C |
| 2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | Unknown |
| 3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | C |
| 4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | Unknown |
# Group the data by 'Pclass' and 'Sex' and calculate the median age for each group
age_medians = train_data.groupby(['Pclass', 'Sex'])['Age'].median()
# Define a function to impute age based on 'Pclass' and 'Sex'
def impute_age(row):
if pd.isnull(row['Age']):
return age_medians[row['Pclass'], row['Sex']]
else:
return row['Age']
# Apply the function to the dataset
train_data['Age'] = train_data.apply(impute_age, axis=1)
# Group the data by 'Pclass' and 'Sex' and calculate the median age for each group
test_age_medians = test_data.groupby(['Pclass', 'Sex'])['Age'].median()
# Define a function to impute age based on 'Pclass' and 'Sex'
def impute_age(row):
if pd.isnull(row['Age']):
return test_age_medians[row['Pclass'], row['Sex']]
else:
return row['Age']
# Apply the function to the dataset
test_data['Age'] = test_data.apply(impute_age, axis=1)
# Extract the first letter of the cabin to represent the cabin section
test_data['CabinSection'] = test_data['Cabin'].apply(lambda x: x[0] if pd.notnull(x) else x)
test_data['CabinSection'].fillna('Unknown', inplace=True)
train_data.hist(figsize=(10, 8))
array([[<Axes: title={'center': 'PassengerId'}>,
<Axes: title={'center': 'Survived'}>,
<Axes: title={'center': 'Pclass'}>],
[<Axes: title={'center': 'Age'}>,
<Axes: title={'center': 'SibSp'}>,
<Axes: title={'center': 'Parch'}>],
[<Axes: title={'center': 'Fare'}>, <Axes: >, <Axes: >]],
dtype=object)
# Embarked, Cabin Section, Pclass, Sex, VS Survivied
Key_df = train_data[['Sex','Embarked','Pclass','CabinSection','Survived']]
Key_df.loc[:, 'Survived'] = Key_df['Survived'].map({1: "Yes", 0: "No"})
Key_df.loc[:, 'Pclass'] = Key_df['Pclass'].map({1: 'Upper', 2: 'Middle', 3: 'Lower'})
Key_df.loc[:, 'Embarked'] = Key_df['Embarked'].map({'C': 'Cherbourh', 'Q': 'Queenstown', 'S': 'Southampton'})
Key_df.head()
| Sex | Embarked | Pclass | CabinSection | Survived | |
|---|---|---|---|---|---|
| 0 | male | Southampton | Lower | Unknown | No |
| 1 | female | Cherbourh | Upper | C | Yes |
| 2 | female | Southampton | Lower | Unknown | Yes |
| 3 | female | Southampton | Upper | C | Yes |
| 4 | male | Southampton | Lower | Unknown | No |
#Survival rate group by embarked
Key_df.groupby('Embarked')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
| Survived | No | Yes |
|---|---|---|
| Embarked | ||
| Cherbourh | 0.446429 | 0.553571 |
| Queenstown | 0.610390 | 0.389610 |
| Southampton | 0.660991 | 0.339009 |
#Survival rate group by Pclass
Key_df.groupby('Pclass')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
| Survived | No | Yes |
|---|---|---|
| Pclass | ||
| Upper | 0.370370 | 0.629630 |
| Middle | 0.527174 | 0.472826 |
| Lower | 0.757637 | 0.242363 |
#Survival rate group by Sex
Key_df.groupby('Sex')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
| Survived | No | Yes |
|---|---|---|
| Sex | ||
| female | 0.257962 | 0.742038 |
| male | 0.811092 | 0.188908 |
#Survival rate group by CabinSection
Key_df.groupby('CabinSection')['Survived'].value_counts(normalize=True).unstack().sort_values(by='Yes', ascending=False)
| Survived | No | Yes |
|---|---|---|
| CabinSection | ||
| D | 0.242424 | 0.757576 |
| E | 0.250000 | 0.750000 |
| B | 0.255319 | 0.744681 |
| F | 0.384615 | 0.615385 |
| C | 0.406780 | 0.593220 |
| G | 0.500000 | 0.500000 |
| A | 0.533333 | 0.466667 |
| Unknown | 0.700146 | 0.299854 |
| T | 1.000000 | NaN |
#CabinSection "unknown" count
Key_df['CabinSection'].value_counts()
CabinSection Unknown 687 C 59 B 47 D 33 E 32 A 15 F 13 G 4 T 1 Name: count, dtype: int64
import seaborn as sns
import matplotlib.pyplot as plt
# Set the style for the plots
sns.set(style='whitegrid')
# Create a barplot for the number of survivors by 'Embarked'
plt.figure(figsize=(8, 5))
sns.countplot(x='Embarked', hue='Survived', data=Key_df)
plt.xlabel('Embarked')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Embarked')
plt.show()
# Create a barplot for the number of survivors by 'Pclass'
plt.figure(figsize=(8, 5))
sns.countplot(x='Pclass', hue='Survived', data=Key_df)
plt.xlabel('Pclass')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Pclass')
plt.show()
# Create a barplot for the number of survivors by 'Sex'
plt.figure(figsize=(8, 5))
sns.countplot(x='Sex', hue='Survived', data=Key_df)
plt.xlabel('Sex')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Sex')
plt.show()
# Create a barplot for the number of survivors by 'CabinSection'
plt.figure(figsize=(10, 5))
sns.countplot(x='CabinSection', hue='Survived', data=Key_df) # Use train_data since we created 'CabinSection' in train_data
plt.xlabel('Cabin Section')
plt.ylabel('Number of Passengers')
plt.title('Number of Survivors by Cabin Section')
plt.show()
A majority of the passengers embarked from Southampton, but the survival rate was lower than the other ports.
Passengers who embarked from Cherbourg had a higher survival rate compared to the other ports, with the number of survivors exceeding the number of non-survivors.
Queenstown had a relatively balanced number of survivors and non-survivors, indicating no significant trend in survival rate.
Passengers in the Lower class (Pclass = 'Lower') had the lowest survival rate, with a significantly higher number of non-survivors compared to survivors.
Passengers in the Upper class (Pclass = 'Upper') had the highest survival rate, with more survivors than non-survivors.
The Middle class had a more balanced number of survivors and non-survivors, indicating a less pronounced impact on survival rate compared to the other classes.
Female passengers had a much higher survival rate than male passengers. The number of female survivors is more than twice the number of non-survivors.
Male passengers had a significantly lower survival rate, with the number of non-survivors being more than four times the number of survivors.
Age_fig = make_subplots(rows=1,cols=1,specs=[[{"type":"histogram"}]])
Age_fig.add_trace(
go.Histogram(
x = train_data['Age'].where(train_data['Survived']==1),
name = 'Yes',
marker={'color':'white'},
nbinsx=10
),
row=1,col=1
)
Age_fig.add_trace(
go.Histogram(
x = train_data['Age'].where(train_data['Survived']==0),
name = 'No',
marker={'color':'red'},
nbinsx=10
),
row=1,col=1
)
Age_fig.update_layout(title_text="Age survive distribution",template='plotly_dark', width=500)
Age_fig.show()
# 0-9 years old, 10-19 years old, 20-29 years old, 30-39 years old, 40-49 years old, 50-59 years old, 60-69 years old, 70-79 years old, 80-89 years old. The pivot table of the survival rate of each age group.
AgeGroup = pd.cut(train_data['Age'], bins=[0, 9, 19, 29, 39, 49, 59, 69, 79, 89])
train_data.pivot_table(index=AgeGroup, values='Survived', aggfunc=np.mean).sort_values(by='Survived', ascending=False)
| Survived | |
|---|---|
| Age | |
| (79, 89] | 1.000000 |
| (0, 9] | 0.612903 |
| (29, 39] | 0.454054 |
| (49, 59] | 0.416667 |
| (9, 19] | 0.401961 |
| (39, 49] | 0.354545 |
| (59, 69] | 0.315789 |
| (19, 29] | 0.315642 |
| (69, 79] | 0.000000 |
Most people are 20-29 years old.
kid seem have high chance to survive. Teenager, middle age have high chance not survive.
# Create a new dataframe that drop cabin
Clean_train_data = train_data.drop('Cabin', axis=1).copy()
#Combine SibSp and Parch to get a 'FamilySize' feature
Clean_train_data['FamilySize'] = Clean_train_data['SibSp'] + Clean_train_data['Parch'] + 1
#Create a 'IsAlone' feature based on 'FamilySize':
Clean_train_data['IsAlone'] = Clean_train_data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
# Show the relationship between # of siblings / spouses aboard the Titanic and survive rate
Family_fig = make_subplots(rows=1,cols=1,specs=[[{"type":"histogram"}]])
Family_fig.add_trace(
go.Histogram(
x = Clean_train_data['FamilySize'].where(Clean_train_data['Survived']==1),
name = 'Yes',
marker={'color':'white'},
nbinsx=20
),
row=1,col=1
)
Family_fig.add_trace(
go.Histogram(
x = Clean_train_data['FamilySize'].where(Clean_train_data['Survived']==0),
name = 'No',
marker={'color':'red'},
nbinsx=20
),
row=1,col=1
)
# Fig Width = 500, Height = 400
Family_fig.update_layout(title_text="Family survive distribution",template='plotly_dark', width=500, height=400)
Family_fig.show()
# Count of each family size
Clean_train_data['FamilySize'].value_counts()
FamilySize 1 537 2 161 3 102 4 29 6 22 5 15 7 12 11 7 8 6 Name: count, dtype: int64
# Survival rate of each family size
Clean_train_data.pivot_table(index='FamilySize', values='Survived', aggfunc=np.mean).sort_values(by='Survived', ascending=False)
| Survived | |
|---|---|
| FamilySize | |
| 4 | 0.724138 |
| 3 | 0.578431 |
| 2 | 0.552795 |
| 7 | 0.333333 |
| 1 | 0.303538 |
| 5 | 0.200000 |
| 6 | 0.136364 |
| 8 | 0.000000 |
| 11 | 0.000000 |
1-person families have a lower survival rate on board
2-4 person families have a higher survival rate
Families with more than 4 people have a lower survival rate
TransData = Clean_train_data.copy()
# convert categorical variables into numerical representations
TransData['Sex'] = TransData['Sex'].map({'male': 0, 'female': 1})
# One-hot encoding for 'Embarked' and 'CabinSection'
TransData = pd.get_dummies(TransData, columns=['Embarked', 'CabinSection'])
TransData.drop(['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
df_corr = TransData.corr()
plt.figure(figsize=(20, 15))
sns.heatmap(round(df_corr, 2), annot=True, cmap="mako")
<Axes: >
Clean_test_data = test_data.drop('Cabin', axis=1).copy()
#Combine SibSp and Parch to get a 'FamilySize' feature
Clean_test_data['FamilySize'] = Clean_test_data['SibSp'] + Clean_test_data['Parch'] + 1
#Create a 'IsAlone' feature based on 'FamilySize':
Clean_test_data['IsAlone'] = Clean_test_data['FamilySize'].apply(lambda x: 1 if x == 1 else 0)
# convert categorical variables into numerical representations
Clean_test_data['Sex'] = Clean_test_data['Sex'].map({'male': 0, 'female': 1})
# One-hot encoding for 'Embarked' and 'CabinSection'
Clean_test_data = pd.get_dummies(Clean_test_data, columns=['Embarked', 'CabinSection'])
Clean_test_data['Fare'].fillna(Clean_test_data['Fare'].mean(),inplace=True)
Clean_test_data.isnull().sum()
PassengerId 0 Pclass 0 Name 0 Sex 0 Age 0 SibSp 0 Parch 0 Ticket 0 Fare 0 FamilySize 0 IsAlone 0 Embarked_C 0 Embarked_Q 0 Embarked_S 0 CabinSection_A 0 CabinSection_B 0 CabinSection_C 0 CabinSection_D 0 CabinSection_E 0 CabinSection_F 0 CabinSection_G 0 CabinSection_Unknown 0 dtype: int64
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
# Preprocess the dataset (drop the 'Survived' target variable and non-numeric columns)
X = TransData.drop(['Survived'], axis=1)
y = TransData['Survived']
# Standardize the numeric features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Create a logistic regression model
logreg = LogisticRegression(solver='liblinear')
# Recursive feature elimination with cross-validated selection
rfecv = RFECV(estimator=logreg, step=1, cv=StratifiedKFold(5), scoring='accuracy')
rfecv.fit(X_scaled, y)
print("Optimal number of features: %d" % rfecv.n_features_)
# Plot the number of features versus cross-validation scores
plt.figure()
plt.xlabel("Number of features selected")
plt.ylabel("Cross-validation score (accuracy)")
plt.plot(range(1, len(rfecv.cv_results_['mean_test_score']) + 1), rfecv.cv_results_['mean_test_score'])
plt.show()
# Get the selected features
selected_features = X.columns[rfecv.support_]
print("Selected features:", selected_features)
Optimal number of features: 15
Selected features: Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
'CabinSection_F', 'CabinSection_G', 'CabinSection_T',
'CabinSection_Unknown'],
dtype='object')
Select_Data = TransData[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
'CabinSection_F', 'CabinSection_G',
'CabinSection_Unknown']].copy()
Test_Select = Clean_test_data[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'FamilySize', 'IsAlone',
'Embarked_S', 'CabinSection_C', 'CabinSection_D', 'CabinSection_E',
'CabinSection_F', 'CabinSection_G',
'CabinSection_Unknown']].copy()
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import Perceptron
from sklearn.neural_network import MLPClassifier
#Split the data into training and validation sets
from sklearn.model_selection import train_test_split
X = Select_Data
y = TransData['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
# Preprocess the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
models = [
('Logistic Regression', LogisticRegression(solver='liblinear')),
('KNN', KNeighborsClassifier()),
('Support Vector Machines', SVC()),
('Naive Bayes', GaussianNB()),
('Decision Tree', DecisionTreeClassifier()),
('Random Forest', RandomForestClassifier(n_estimators=100)),
('Perceptron', Perceptron()),
('Artificial Neural Network', MLPClassifier(max_iter=2000))
]
results = []
names = []
for name, model in models:
cv_scores = cross_val_score(model, X_scaled, y, cv=5, scoring='accuracy')
results.append(cv_scores)
names.append(name)
print(f"{name}: {cv_scores.mean():.3f} ({cv_scores.std():.3f})")
plt.figure(figsize=(10, 5))
plt.boxplot(results)
plt.xticks(range(1, len(names) + 1), names, rotation=45)
plt.ylabel('Accuracy')
plt.title('Model Comparison Using Cross-Validation')
plt.show()
Logistic Regression: 0.799 (0.014) KNN: 0.785 (0.024) Support Vector Machines: 0.800 (0.019) Naive Bayes: 0.468 (0.115) Decision Tree: 0.787 (0.045) Random Forest: 0.801 (0.028) Perceptron: 0.749 (0.047) Artificial Neural Network: 0.802 (0.044)
The ANN model has the highest mean accuracy (0.802) among the compared models. However, it also has a relatively higher standard deviation (0.044) compared to some of the other models.
The Support Vector Machines (SVM) model seem is a good choice also, due to its high accuracy and lower standard deviation.
# Preprocess the training dataset (drop the 'Survived' target variable and non-numeric columns)
X_train = Select_Data
y_train = TransData['Survived']
# Standardize the numeric features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
# Create and fit the ANN model
ann = MLPClassifier(max_iter=1000)
ann.fit(X_train_scaled, y_train)
MLPClassifier(max_iter=1000)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
MLPClassifier(max_iter=1000)
# Preprocess the test dataset
X_test = Test_Select
# Standardize the numeric features using the same scaler fitted on the training data
X_test_scaled = scaler.transform(X_test)
# Predict the survival probabilities for the test dataset
y_test_proba = ann.predict_proba(X_test_scaled)
# Predict the survival outcomes for the test dataset
y_test_pred = ann.predict(X_test_scaled)
output = pd.DataFrame({'PassengerId': test_data['PassengerId'], 'Survived': y_test_pred})
output.to_csv('predictions.csv', index=False)